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BioMart Central Portal is a first of its kind, community-driven effort to provide unified access to dozens of biological 
databases spanning genomics, proteomics, model organisms, cancer data, ontology information and more. Anybody can 
contribute an independently maintained resource to the Central Portal, allowing it to be exposed to and shared with the 
research community, and linking it with the other resources in the portal. Users can take advantage of the common inter- 
face to quickly utilize different sources without learning a new system for each. The system also simplifies cross-database 
searches that might otherwise require several complicated steps. Several integrated tools streamline common tasks, such as 
converting between ID formats and retrieving sequences. The combination of a wide variety of databases, an easy-to-use 
interface, robust programmatic access and the array of tools make Central Portal a one-stop shop for biological data 
querying. Here, we describe the structure of Central Portal and show example queries to demonstrate its capabilities. 
Database URL: http://central.biomart.org. 
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Figure 1. Databases available on the BioMart Central Portal and their host countries (April 2011). 



Project description 

Introduction 

BioMart is a free, open-source, federated database system 
(1-3). It is cross-platform and supports many popular rela- 
tional database managements systems, including MySQL, 
Oracle, PostgreSQL, SQL Server and DB2. The software is 
data-agnostic, and can therefore be easily adapted to exist- 
ing data sets. It is expandable and customizable through a 
plug-in system, and is open-source so the community can 
participate in deeper development. Furthermore, BioMart 
can seamlessly connect geographically disparate databases, 
facilitating collaboration between different groups. These 
features have catalyzed the creation of BioMart Central 
Portal, a first of its kind community-supported effort to 
create a single access point integrating many different, in- 
dependently administered biological databases (Figure 1). 

For administrators, participation in Central Portal offers 
several benefits. Central Portal can provide an instantly 
available and automatically updated source of annotations 
for other projects, as is done in the International Cancer 
Genome Consortium Data Portal (4). Being part of the com- 
munity can also expose a database to a wide user base. 
Furthermore, because the BioMart software allows 



administrators to easily create their own plug-ins, joining 
the community allows administrators to take advantage of 
the tools that others have created, thereby enhancing their 
own databases. Central Portal passes queries directly to the 
individual member servers, so administrators retain full 
control of their databases and their data (Figure 2). 

For users, Central Portal offers a central repository for a 
vast array of biological data. BioMart can interoperate with 
other web sites, because results can be configured to link to 
outside resources; examples in Central Portal include KEGG 
pathway information (5-7) and Pancreatic Expression 
Database entries (8). The intuitive interface is consistent 
across all databases, so users familiar with one source can 
immediately transfer their skills to another data source. 
Since Central Portal is constantly updated, users are imme- 
diately exposed to new resources as they become available. 
In addition to the web-based interface, Central Portal also 
offers a wide variety of other access methods for more 
advanced querying, including application programming 
interfaces (APIs) for Java, SPARQL, REST and SOAP. 

Moreover, both users and administrators benefit from 
the value gained by having individual databases connected 
in a central access point. By allowing data sets to be linked 
together, resources can be combined in novel ways, 
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Figure 2. Each individual server hosts its own instance of BioMart retrieving data from its own local database backend. Central 
Portal offers a unified access point to all of these databases, distributing queries to the appropriate servers. 



potentially revealing unexpected connections or suggesting 
new avenues of inquiry. The strength of the Central Portal 
comes from the fact that it is created and supported by a 
large community, and, as a whole, it is greater than the sum 
of its parts. 

Interface 

When viewing the Central Portal home page, users are pre- 
sented with the main querying section, which is divided 
into three subsections: Identifier Search, Tools and 
Database Search (Figure 3). 

The Identifier Search (Figure 3A) allows users to input 
gene identifiers in a number of formats (e.g. Gene name, 
EnsembI IDs, RefSeq IDs, etc.) and search for it across all of 
the member databases in the Portal. The result of the 
search links to a report page for the identifier, which sum- 
marizes key information about the search term taken from 
several sources (Figure 4). With this function users can 
quickly find information about a single identifier, and per- 
haps even locate resources that they did not realize were 
applicable to the target of their query. 



The Tools section (Figure 3B) contains links to various 
data analysis tools in four categories: Gene retrieval, 
Variant retrieval, Sequence retrieval and ID Converter. 
The first two sections allow quick access to some of the 
largest and most popular databases contained in Central 
Portal. The third section, Sequence retrieval, allows easy 
querying of genomic and protein sequences in any of sev- 
eral formats (Figure 5). The fourth section, the ID Converter 
tool, allows users to enter or upload a list of identifiers in 
any format supported by a BioMart database, and retrieve 
the same list converted to any other supported format. 

In the Database Search section (Figure 3C), users can 
access the individual member databases for querying 
through the BioMart interface. To make finding the rele- 
vant database easier, users can choose to browse databases 
by the type of information contained therein (Search by 
type) or by the organism with which the database is con- 
cerned (Search by organism). Browse by type is further sub- 
divided into several categories such as Genome [e.g. 
EnsembI databases (9)], Gene annotation [e.g. HGNC (10)], 
Protein sequence and structure [e.g. InterPro (11)], 
Interactions and pathways [e.g. Reactome (12)], Gene 
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expression [e.g. EMAGE (13)], Cancer [e.g. COSMIC (14)] and 
Model organism databases [e.g. Gramene (15)], Search by 
organism is subdivided into categories for bacteria, plants, 
protists, invertebrates and vertebrates. After choosing a 
data set, users can construct queries using the basic 
BioMart concepts of attributes, which indicate what infor- 
mation should be returned, and filters, which restrict the 
database entries that are retrieved. 



Access methods 

In addition to the graphical user interfaces, Central Portal 
also offers programmatic access to allow for automated 
querying. Several programming interfaces are available: 
an XML querying method that can be accessed via REST 
or SOAP requests, a full Java API and RDF querying via 
SPARQL. The syntax of any of the APIs is easy to use for 
programmers familiar with the basic BioMart concepts 
of attributes, filters and data sets. For example, to retrieve 
a list of filters for a given data set, a client could use the 
REST API and access the URL /martservice/filters?datasets= 
datasetname. Alternatively and equivalently, the client 
could use the Java API using the method getFilters 
(datasetname) to accomplish the same result. Because, 
there are a variety of APIs available, developers can 
choose the access method that makes the most sense for 
their specific applications and use cases. 

To further ease the adoption of the APIs, the equivalent 
code of any query constructed in the web GUI can be 
retrieved in any of the API formats by clicking on the ap- 
propriate button on the query page; in this way, queries 
can be saved, modified and easily transferred from one 
format to another. It also provides a readily available 
graphical method of constructing complex API calls, which 
could be of use in certain tools or scripts. 



Data content 

BioMart Central Portal contains a constantly growing list of 
data sources accessible by a wide variety of methods and 
tools. The following table reflects the contents of the portal 
as of May 2011: 
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Figure 3. The BioMart Central Portal home page. Three main 
entry points are available: (A) Identifier search, (B) Tools and 
(C) Database search. 



Database 


Location 


Description 


References 


Cildb 


CNRS, France 


Database for eukaryotic cilia and centriolar structures, integrating 
orthology relationships for 33 species with high-throughput studies 
and OMIM 


(16) 


COSMIC 


WTSI, UK 


Somatic mutation information relating to human cancers 


(14) 


EMAGE 


MRC HGU, UK 


In situ gene expression data in the mouse embryo 


(13) 


EMMA 


EBI, UK 


Mouse mutant strain information 


(17) 


Ensembl 


WTSI/EBI, UK 


Genome databases for vertebrates and other eukaryotic species 


(9) 


Ensembl Bacteria 


EBI, UK 


Genome databases for bacteria 


(9) 


Ensembl Fungi 


EBI, UK 


Genome databases for fungi 


(9) 


Ensembl Metazoa 


EBI, UK 


Genome databases for metazoa 


(9) 



(continued) 



Page 4 of 9 



Database, Vol. 2011, Article ID bar041, doi:10.1093/database/bar041 



Original article 



Database 


Location 
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References 


Ensembl Plants 


EBI, UK 


Genome databases for plants 
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Eurexpress 


MRC HGU, UK 


Transcriptome atlas database for mouse embryo 


(18) 


EuroPhenome 


MRC Harwell, UK 


Mouse phenotyping data 


(19) 


GermOnline 


Inserm, France 


Cross-species microarray expression database focusing on germline 
development, meiosis and gametogenesis as well as the mitotic 
cell cycle 


(20) 


Gramene 


CSHL, USA 


Agriculturally important grass genomes 


(15) 


HapMap 


NCBI, USA 


Multi-country effort to identify and catalog genetic similarities and 
differences in human beings 


(21) 


HGNC 


EBI, UK 


Repository of human gene nomenclature and associated resources 


(10) 


IKMC 


WTSI, UK 


Data on mutant products (mice, ES cells and vectors) generated and 
made available by members of the International Knockout Mouse 
Consotium 


(22) 


InterPro 


EBI, UK 


Integrated database of predictive protein 'signatures' used for the 
classification and automatic annotation of proteins and genomes 


(11) 


IntOGen 


UPF, Spain 


Integrated multi-dimensional data for the identification of genes and 
groups of genes involved in cancer development 


(23) 


KazusaMart 


Kazusa, Japan 


Cyanobase, rhizobia and plant genome databases 


(24) 


MGI 


Jackson Laboratory, 
USA 


Mouse genome features, locations, alleles and orthologues 


(25) 
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Barts Cancer 
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Results from published pancreatic cancer papers 
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CNRS, France 


Paramecium genome database 
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(27) 


Phytozome 


JGI/CIG, USA 


Comparative genomics of green plants 


(28) 


Potato Database 


CIP, Peru 


Potato and sweet potato phenotypic and genomic information 


(29) 


PRIDE 


EBI, UK 


Repository for protein and peptide identifications 


(30) 


Reactome 


OICR, Canada; EBI, 
UK; NYU Medical 
Cpntpr USA 


Curated pathway annotation database 


(12) 
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SDxMart 


UCLA, USA 


Saliva diagnostics for high-impact human diseases 


(33) 


sigReannot 


Rennes, France 


Aquaculture and farm animal species EST contigs 


(34) 


UniProt 


EBI, UK 


Protein sequence and functional information 


(35) 


VectorBase 


University of Notre 
Dame, USA 


Genome information for invertebrate vectors of human pathogens 


(36) 


VEGA 


WTSI, UK 


Manual annotation of vertebrate genome sequences 


(37) 


WormBase 


California Institute 
of Technology, 
USA; CSHL, USA; 
EBI, UK; 
Washington 
University, USA 


Caenorhabditis elegans and related nematode genomic information 


(38) 


WTSI Mouse 
Genetics 


WTSI, UK 


Mouse phenotyping and expression data captured from mutant 
mouse lines 


(39) 
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Gene Report 



Ensembl Gen* ID(s) 



ENSG0000014664S 



GENfc INFO 

Ensembl Gene ID: ENSG00000 146648 [EGFR] 



PKcrrplion: 


epidermal growth factor receptor 


Chromosome: 






|Source:HGNC Symbol;Acc:3236J 




Gene Start {bp): 
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Gene End {bp|: 
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Ref Seq DNA ID: 


NM_201283.NM 005228. NM_2Q1284, 


UnjProt/SwksProt 


P0O533 




NM_201282 


Accession: 







r Pathway Annotation 

Hath i»v name (Rejctomc): Signaling by EGFR. Grfo2 events in EGFR signaling Gabl signal osome. She events in EGFR signaling. EGFR 
down regulation. EGFR Interacts with phospholipase C- gamma. L T CAM interactions. Axon guidance. Signal transduction by Li 



• CO B ut \i Pro< iss 

go Biological Process: protein phosphorylation, transmembrane receptor protein tyrosine kinase signaling pathway, regulation of pept idyl -tyro sine 
pho sph o rv at ton posi ti v e regu ratio n of l/A P ki nase act i vity - - □ r p h a g en e - • neu ra n praj ect to n rro rph o q enesi s -it i . - ■ eg u I at i c n o I nitr : o x i de 
bio synthetic process, epidermal growth factor receptor signaling pathway, positive regulation of epithelial cell proliferation, activation of MAPKK 
activity, response to oxidative stress, response to lipid, response to calcium ion, positive regulation of catenin protein nuclear translocation, response to 
stress, activation of phospholipase G activity translation, salivary gland fnofphogenesis. tonquerJevelcpment. positive requEat ion of synaptic 
transmission, glutamaterglc. positive regulation of protein kinase 6 signaling cascade, positive regulation of cell migration, signal transduction, positive 
regulation of cell proliferation, response to osmotic stress, negative regulation of mitotic cell cycle protein auto phosphorylation, ossification, cell 
surf ace receptor linked signaling pathway, Orcadian rhythm ovulation cycle, negative regulation of apeptosis, astrocyte Ovation, protein insertion 
into membrane, cell proliferation, embryonic placenta development, hair follicle development, positive regulation of cyclin -dependent protein kinase 
activity involved in Gl/S. positive regulation of phosphorylation, cerebral cortex ceil migration, positive regulation of smooth muscle cell proliferation, 
response to UV-A regulation of nitric-oxide synthase activity, cell-cell adhesion, activation of phospholipase A2 activity by calcium- mediated signaling, 
intracellular protein kinase cascade, morphogenesis of an epithelial fold 



* GO CawiAK Component 

GO Cellular Component: membrane, plasma membrane, apical plasma membrane, extracellular region, extracellular space, basolateral plasma 
membrane, endosome, cytoplasm. Shc-EGFR complex, nucleus, integral to membrane, endocytic veside, AP-2 adaptor complex, intracellular 



■ GO Molecular Function 

GO Molecular Function; protein kinase activity, ATP binding, proten serine/threonine kinase activity, transmembrane receptor protein tyrosine kinase 
activity, protein tyrosine kinase activity, protein heterodlmerizatlon activity, protein binding, receptor signaling protein tyrosine kinase activity, actln 
filament binding. MAP/ERK kinase kinase activity, epidermal growth factor receptor activity, tianamembrane receptor activity, nitric-oxide synthase 
regulator activity . transferase activity, double-stranded DMA binding, protein phosphatase binding, identical protein : :• i nucleotide binding. 



Figure 4. The Gene Report page for EGFR, displaying data federated from several sources. 
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Figure 5. The sequence retrieval plug-in page. 
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Query examples 

One of the great strengths of Central Portal is that it allows 
cross-database searches that any individual resource would 
not. Here are some examples of the possibilities afforded 
by this feature. 

Query #1: 'Find insertion-frameshift mutations in the 
COSMIC database that affect genes involved in Apoptosis'. 



Entry point 


Filters 


Gene retrieval > 


COSMIC: 


cancer genes 


Mutation type-AA: Insertion-frameshift 




KEGG: 




KEGG Pathway: apoptosis 



By integrating data from the COSMIC and KEGG databases, 
Central Portal allows users to identify COSMIC mutations 
specific to their pathways of interest. The Pathway title 
links back to the KEGG web site and mutation ID links 
back to the COSMIC web site, providing the ability to 
obtain more detailed information on the pathway or on 
the mutation, respectively. 

Query #2: 'Retrieve the cDNA sequences of protein- 
coding human genes that have HGNC IDs' (Figure 5). 



Entry point 


Data sets 


Filters/attributes 


Sequence retrieval > 


Homo sapiens 


Sequences: cDNA 


Ensembl 


gene 


sequences 




(GRCh37.p2) 








Filters: 






Limit to genes: with 






HGNC ID(s) 






Type: protein_coding 






Header information: 






Ensembl Gene ID 






Ensembl Transcript ID 



By combining the sequence retrieval tool with search cap- 
abilities, BioMart reduces what is often a two-step pro- 
cess — retrieving a list of genes, and then retrieving the 
sequences of those genes — into a single query. 



Future directions 

BioMart Central Portal is constantly evolving thanks to 
the efforts of the community that supports it and contrib- 
utes data. To make joining Central Portal easier, we are 
creating BioMart Central Registry. With this resource, 
database administrators will be able to create an account, 
add their data sources and suggest categorization for 
them. Once registered, participants will also be able to 
make changes to their databases and notify Central Portal 
of updates. 



In addition to including new data sets, Central Portal will 
evolve, as new tools are developed and added. Such tools 
will perform deeper analysis, such as detecting enrichment 
of certain properties (e.g. GO terms) within a given set of 
genes or calculating consequences given a list of SNP terms. 
BioMart plug-ins developed by other community members 
may also be incorporated, further strengthening the pro- 
ject as a whole. 
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